A computational model of inner speech supporting flexible goal-directed behaviour in Autism

Experimental and computational studies propose that inner speech boosts categorisation skills and executive functions, making human behaviour more focused and flexible. In addition, many clinical studies highlight a relationship between poor inner-speech and an executive impairment in autism spectrum condition (ASC), but contrasting findings are reported. Here we directly investigate the latter issue through a previously implemented and validated computational model of the Wisconsin Cards Sorting Tests. In particular, the model was applied to explore potential individual differences in cognitive flexibility and inner speech contribution in autistic and neurotypical participants. Our model predicts that the use of inner-speech could increase along the life-span of neurotypical participants but would be reduced in autistic ones. Although we found more attentional failures (i.e., wrong behavioural rule switches) in autistic children/teenagers and more perseverative behaviours in autistic young/older adults, only autistic children and older adults exhibited a lower performance (i.e., fewer consecutive correct rule switches) than matched control groups. Overall, our results corroborate the idea that the reduced use of inner speech could represent a disadvantage for autistic children and autistic older adults. Moreover, the results suggest that cognitive-behavioural therapies should focus on developing inner speech skills in autistic children as this could provide cognitive support throughout their whole life span.


Review process underlying our selection of four experimental studies
The selection of four specific experimental studies in this work is the product of an extended literature review process. In particular, we have taken in consideration all studies that administrate the WSCT to autistic samples. Then, we used three criteria for selecting those we used here: (a) they adopted the Heaton's version of WCST, (b) they involved an autistic group and a matched control group, and (c) they reported at least CC, PE, and NPE indices. Table S1 reports the 39 studies we initially considered, of which the last four are the ones we selected for this modelling work. Table S1: Studies we considered during our selection process.
2 Computational details of the model Environment The cards we used are polygons with a unique combination of three visual dimensions (colour, form, and size), each having one of four possible attributes: colour (red, green, blue, yellow); form (square, circle, triangle, bar); size (large, medium-large, medium-small, small). There are thus 4 3 = 64 combinations (cards) of attributes. We created a simulated environment composed by the objects (cards) which the model can visually explore (visual search) and on which it can executes a physical action (displacement).
Visual sensor The visual sensor returns a 28 × 28 × 3 RGBY pixel matrix, representing a limited portion of the whole virtual table. The visual sensor is actively moved, in a top-down way (visual search), toward the deck and then sequentially toward the target cards. These matrices are then flattened in a vector of 2352 elements and represent the perceptual input to the model.

Working-memory
The working-memory is formed by three recurrent units, each having a self-connection, which can acquire a continuous value ranging in [0, 1]. The activation of the each unit is characterised by an internal decay toward a baseline (0.5) and is described by the following equation: where m l,t is the value related to a losing unit l (l ∈ 1, 2, 3; l = s, where s is the selected unit considered below) at time t, 1 − φ is the strength of the recurrent connection, and α = 0.5 is the baseline value to which the memory unit activation converges. The activation of each unit represents the likelihood of selection that the system assigns to each of the three possible matching rules of the task related to colour, form, and size. The parameters φ is a critical parameter of the model investigated in the simulations.
Motivational component This component is supported by a reinforcement learning algorithm. In particular it receives the external feedback signal (a binary value in {0, 1}) and subsequently affects the activation of the unit encoding the last selected and used rule, as follows: where m s,t is the new activation of the rule unit, s ∈ {1, 2, 3} is the index of the selected rule, m s,t−1 is the current activation of the unit, (1 − µ) is the strength of the unit recurrent connection, µ regulates the impact of the feedback on the memory, and r is the feedback signal that is equal to 1 in case of positive feedback (correct matching of the deck card and target card) and 0 otherwise. The parameter µ is set to a fixed value of 0.7 for positive feedback and to a variable value for the negative feedback. The latter value is a critical parameter of the model investigated in the simulations.
Hierarchical perceptual component This component is supported by a deep generative model, in particular a Deep Belief Network (DBN, [40]) composed of two stacked Restricted Boltzmann Machines (RBM). We trained the first RBM, composed of the input layer and the first hidden layer of DBN, with a classical unsupervised learning algorithm for this model (contrasting divergence, [41]). We trained the second RBM, composed by the first and second hidden layers of the DBN, with a modified version of the original algorithm that allows us to alter the reconstructions of original inputs to obtain prototypical representations of input image features on which the system focuses on (e.g., in case of a focus on colour, a red triangle given as input is reconstructed as a shapeless red blob). This modification causes the emergence of three groups of units in the last layer of DBN (its second hidden layer), each corresponding to specific visual categories of the input (first four units for colour: red, green, blue, yellow; second four units: square, circle, bar, triangle; third four units: small, medium-small, medium-large, large). The model is able to 'reconstruct' ('generate') the original input trough a bidirectional activation from the input layer, to the hidden layer, and then back to the input layer. In particular, the selector and manipulator considered below are able to select one category (one group of four units), and one attribute within it (one neural unit), to produce the prototypical rule-based reconstruction of images mentioned above.
Selector and manipulator components The selector is supported by a sof tmax function, a winner-take-all (WTA) function that receives the values from the working memory as input, and chooses the matching rule as follows: parameter manipulated in the simulations. A high value of τ causes a high randomness/exploration of the decisions. The probabilities P r(·), summing up to 1, are used to stochastically select the matching rule to use. The manipulator is composed by two layers of 3 units, linked with one by one negative projections. Each unit of the second layer is always active and has negative projections to a specific group of the last layer of the perceptual component, so the activation of a specific unit in the first layer of the manipulator causes a disinhibition of the corresponding group of the last layer of the perceptual component. Moreover, the manipulator implements a Hard-max function leading to select only one unit (attribute) within the each group (category) of four units.
Verbal component This component is supported by a multi-layer perceptron (MLP), formed by 4 input units, 10 sigmoid hidden units, and 3 output linear units. In particular it receives one-to-one connections from the selector units and sends one-to-one connections to the WM units. This process is in particular implemented as follows: where m t is the new activation of a WM rule unit, m t−1 is the current activation of the WM unit, λ represents the strengths of the one-to-one connection weights linking the language component output-layer

Fitting results and comparison between the behaviour of the models and of human groups: models validation details
We adopted the same procedure corroborated in [42] to execute the parameters search, aimed to find the parameters of the models that produce the behaviour that best fits those of human populations. In particular we randomly-sampled 3, 000 combinations of parameters each drawn with a uniform distribution in the following ranges: φ: where y is the vector of mean indices of the human group, y is the vector of mean indices of the considered parameter combination, || · || 2 2 is the square of the L2 norm, and n is the length of vectors.